AITopics | alignment training

Collaborating Authors

alignment training

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PLLuM: A Family of Polish Large Language Models

Kocoń, Jan, Piasecki, Maciej, Janz, Arkadiusz, Ferdinan, Teddy, Radliński, Łukasz, Koptyra, Bartłomiej, Oleksy, Marcin, Woźniak, Stanisław, Walkowiak, Paweł, Wojtasik, Konrad, Moska, Julia, Naskręt, Tomasz, Walkowiak, Bartosz, Gniewkowski, Mateusz, Szyc, Kamil, Motyka, Dawid, Banach, Dawid, Dalasiński, Jonatan, Rudnicka, Ewa, Alberski, Bartłomiej, Walkowiak, Tomasz, Szczęsny, Aleksander, Markiewicz, Maciej, Bernaś, Tomasz, Mazur, Hubert, Żyta, Kamil, Tykierko, Mateusz, Chodak, Grzegorz, Kajdanowicz, Tomasz, Kazienko, Przemysław, Karlińska, Agnieszka, Seweryn, Karolina, Kołos, Anna, Chrabąszcz, Maciej, Lorenc, Katarzyna, Krasnodębska, Aleksandra, Wilczek, Artur, Dziewulska, Katarzyna, Betscher, Paula, Cieślińska, Zofia, Kowol, Katarzyna, Mikoś, Daria, Trzciński, Maciej, Krutul, Dawid, Kozłowski, Marek, Dadas, Sławomir, Poświata, Rafał, Perełkiewicz, Michał, Grębowiec, Małgorzata, Kazuła, Maciej, Białas, Marcin, Roszko, Roman, Roszko, Danuta, Vaičenonienė, Jurgita, Utka, Andrius, Levchuk, Paweł, Kowalski, Paweł, Prawdzic-Jankowska, Irena, Ogrodniczuk, Maciej, Borys, Monika, Bulińska, Anna, Gumienna, Wiktoria, Kieraś, Witold, Komosińska, Dorota, Krasnowska-Kieraś, Katarzyna, Kobyliński, Łukasz, Lewandowska, Martyna, Łaziński, Marek, Łątkowski, Mikołaj, Mastalerz, Dawid, Milewicz, Beata, Mykowiecka, Agnieszka Anna, Peljak-Łapińska, Angelika, Penno, Sandra, Przybysz, Zuzanna, Rudolf, Michał, Rybak, Piotr, Saputa, Karolina, Tomaszewska, Aleksandra, Wawer, Aleksander, Woliński, Marcin, Wołoszyn, Joanna, Wróblewska, Alina, Żuk, Bartosz, Żarnecki, Filip, Kaczyński, Konrad, Cichosz, Anna, Deckert, Zuzanna, Garnys, Monika, Grabarczyk, Izabela, Janowski, Wojciech, Karasińska, Sylwia, Kujawiak, Aleksandra, Misztela, Piotr, Szymańska, Maria, Walkusz, Karolina, Siek, Igor, Kwiatkowski, Jakub, Pęzik, Piotr

arXiv.org Artificial IntelligenceNov-7-2025

Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.03823

Country:

North America (1.00)
Europe > Poland (1.00)
Asia (1.00)

Genre:

Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Law Enforcement & Public Safety (1.00)
Information Technology > Security & Privacy (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.92)

Add feedback

AI Agents and the Law

Riedl, Mark O., Desai, Deven R.

arXiv.org Artificial IntelligenceAug-13-2025

As AI becomes more "agentic," it faces technical and socio-legal issues it must address if it is to fulfill its promise of increased economic productivity and efficiency. This paper uses technical and legal perspectives to explain how things change when AI systems start being able to directly execute tasks on behalf of a user. We show how technical conceptions of agents track some, but not all, socio-legal conceptions of agency. That is, both computer science and the law recognize the problems of under-specification for an agent, and both disciplines have robust conceptions of how to address ensuring an agent does what the programmer, or in the law, the principal desires and no more. However, to date, computer science has under-theorized issues related to questions of loyalty and to third parties that interact with an agent, both of which are central parts of the law of agency. First, we examine the correlations between implied authority in agency law and the principle of value-alignment in AI, wherein AI systems must operate under imperfect objective specification. Second, we reveal gaps in the current computer science view of agents pertaining to the legal concepts of disclosure and loyalty, and how failure to account for them can result in unintended effects in AI ecommerce agents. In surfacing these gaps, we show a path forward for responsible AI agent development and deployment.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.08544

Country: North America > United States (0.93)

Genre: Research Report (0.50)

Industry:

Law (1.00)
Banking & Finance > Economy (0.48)
Transportation > Passenger (0.46)
Information Technology > Services (0.36)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection

Zhao, Yachao, Wang, Bo, Wang, Yan

arXiv.org Artificial IntelligenceJan-4-2025

Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content. While extensive research has investigated bias in LLMs, prior work has predominantly focused on explicit bias, leaving the more nuanced implicit biases largely unexplored. This paper presents a systematic framework grounded in social psychology theories to investigate and compare explicit and implicit biases in LLMs. We propose a novel "self-reflection" based evaluation framework that operates in two phases: first measuring implicit bias through simulated psychological assessment methods, then evaluating explicit bias by prompting LLMs to analyze their own generated content. Through extensive experiments on state-of-the-art LLMs across multiple social dimensions, we demonstrate that LLMs exhibit a substantial inconsistency between explicit and implicit biases, where explicit biases manifest as mild stereotypes while implicit biases show strong stereotypes. Furthermore, we investigate the underlying factors contributing to this explicit-implicit bias inconsistency. Our experiments examine the effects of training data scale, model parameters, and alignment techniques. Results indicate that while explicit bias diminishes with increased training data and model size, implicit bias exhibits a contrasting upward trend. Notably, contemporary alignment methods (e.g., RLHF, DPO) effectively suppress explicit bias but show limited efficacy in mitigating implicit bias. These findings suggest that while scaling up models and alignment training can address explicit bias, the challenge of implicit bias requires novel approaches beyond current methodologies.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.02295

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Leisure & Entertainment (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hybrid Alignment Training for Large Language Models

Wang, Chenglong, Zhou, Hang, Chang, Kaiyan, Li, Bei, Mu, Yongyu, Xiao, Tong, Liu, Tongran, Zhu, Jingbo

arXiv.org Artificial IntelligenceJun-21-2024

Alignment training is crucial for enabling large language models (LLMs) to cater to human intentions and preferences. It is typically performed based on two stages with different objectives: instruction-following alignment and human-preference alignment. However, aligning LLMs with these objectives in sequence suffers from an inherent problem: the objectives may conflict, and the LLMs cannot guarantee to simultaneously align with the instructions and human preferences well. To response to these, in this work, we propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods. The basic idea is to alternate between different objectives during alignment training, so that better collaboration can be achieved between the two alignment tasks.We experiment with Hbat on summarization and dialogue tasks. Experimental results show that the proposed \textsc{Hbat} can significantly outperform all baselines. Notably, Hbat yields consistent performance gains over the traditional two-stage alignment training when using both proximal policy optimization and direct preference optimization.

alignment, alignment training, human-preference alignment, (16 more...)

arXiv.org Artificial Intelligence

2406.15178

Country:

Asia > China > Liaoning Province > Shenyang (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

Lee, Hyunseok, Tack, Jihoon, Shin, Jinwoo

arXiv.org Artificial IntelligenceMay-27-2024

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results.

dataset, remodetect, reward model, (13 more...)

arXiv.org Artificial Intelligence

2405.17382

Country: Europe > United Kingdom > Wales > Merthyr Tydfil (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.66)
Education > Educational Setting (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

Weak-to-Strong Extrapolation Expedites Alignment

Zheng, Chujie, Wang, Ziqi, Ji, Heng, Huang, Minlie, Peng, Nanyun

arXiv.org Artificial IntelligenceMay-22-2024

The open-source community is experiencing a surge in the release of large language models (LLMs) that are trained to follow instructions and align with human preference. However, further training to improve them still requires expensive computational resources and data annotations. Is it possible to bypass additional training and cost-effectively acquire better-aligned models? Inspired by the literature on model interpolation, we propose a simple method called ExPO to boost LLMs' alignment with human preference. Utilizing a model that has undergone alignment training (e.g., via DPO or RLHF) and its initial SFT checkpoint, ExPO directly obtains a better-aligned model by extrapolating from the weights of the initial and the aligned models, which implicitly optimizes the alignment objective via first-order approximation. Through experiments with twelve open-source LLMs on HuggingFace, we demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models, as evaluated on the mainstream LLM benchmarks AlpacaEval 2.0 and MT-Bench. Moreover, ExPO exhibits remarkable scalability across various model sizes (from 1.8B to 70B) and capabilities. Through controlled experiments and further empirical analyses, we shed light on the essence of ExPO amplifying the reward signal learned during alignment training. Our work demonstrates the efficacy of model extrapolation in expediting the alignment of LLMs with human preference, suggesting a promising direction for future research.

arxiv preprint arxiv, international conference, language model, (12 more...)

arXiv.org Artificial Intelligence

2404.16792

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Flaw That Could Ruin Generative AI

The Atlantic - TechnologyJan-11-2024, 18:49:53 GMT

And because a LLM doesn't "know" when it's quoting from training data, there's no obvious way to prevent the behavior. I spoke with Florian Tramèr, a prominent AI-security researcher and co-author of some of the above studies. It's "an extremely tricky problem to study," he told me. "It's very, very hard to pin down a good definition of memorization." One way to understand the concept is to think of an LLM as an enormous decision tree in which each node is an English word. From a given starting word, an LLM chooses the next word from the entire English vocabulary.

generative ai, memorization, training data, (16 more...)

The Atlantic - Technology

Country: North America > United States > California (0.05)

Industry:

Law > Litigation (1.00)
Law > Intellectual Property & Technology Law (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.60)

Add feedback

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Pisano, Matthew, Ly, Peter, Sanders, Abraham, Yao, Bingsheng, Wang, Dakuo, Strzalkowski, Tomek, Si, Mei

arXiv.org Artificial IntelligenceNov-16-2023

Modern Large language models (LLMs) can still generate responses that may not be aligned with human expectations or values. While many weight-based alignment methods have been proposed, many of them still leave models vulnerable to attacks when used on their own. To help mitigate this issue, we introduce Bergeron, a framework designed to improve the robustness of LLMs against adversarial attacks. Bergeron employs a two-tiered architecture. Here, a secondary LLM serves as a simulated conscience that safeguards a primary LLM. We do this by monitoring for and correcting potentially harmful text within both the prompt inputs and the generated outputs of the primary LLM. Empirical evaluation shows that Bergeron can improve the alignment and robustness of several popular LLMs without costly fine-tuning. It aids both open-source and black-box LLMs by complementing and reinforcing their existing alignment training.

bergeron, disclaimer, primary model, (16 more...)

arXiv.org Artificial Intelligence

2312.00029

Country:

Oceania > Australia (0.04)
North America > Canada (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Government > Military (1.00)
Information Technology > Security & Privacy (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback